Speech recognition system with maximum entropy language models

ABSTRACT

The invention relates to a method of setting a free parameter  
       λ   α   ortho                 
 
     of an attribute in a maximum-entropy speech model, which free parameter could not be set previously with the help of a training algorithm. It is an object of the invention to provide a speech recognition system  100,  a training device  10  and a method of setting such a parameter  
       λ   α   ortho                 
 
     that has a number of possible interpretations. This object is achieved in accordance with the invention in that  
       λ   α   ortho                 
 
     is calculated as follows:  
               λ   α   ortho     =       log        (       m   α     ortho   ,   mod         Nenner   α       )                     with                   m   α     ortho   ,   mod       =         ∑     β   ∈     A   i                m   β   ortho                   and                   denominator   α         =       ∑     β   ∈   Ai              exp        (     -     λ   β   ortho       )       ·       M   β   ortho     .

[0001] The invention relates to a method of setting a free parameter λ_(α)^(ortho)

[0002] of an attribute α in a maximum-entropy speech model, if this free parameter cannot be set with the help of a training algorithm that has been executed previously.

[0003] The invention further relates to a training device and a speech recognition system in which such a method is used.

[0004] The starting point for the construction of a conventional speech model, as used in a computer-aided speech recognition system to recognize speech input, is a predefined training task. The training task models certain statistical samples in the speech of a future user of the speech recognition system in a system of mathematically formulated boundary conditions, which in general has the following form: $\begin{matrix} {{\sum\limits_{({h,w})}{\frac{N(h)}{N} \cdot {p_{\lambda^{ortho}}\left( w \middle| h \right)} \cdot {f_{\alpha}^{ortho}\left( {h,w} \right)}}} = m_{\alpha}^{ortho}} & (1) \end{matrix}$

[0005] where:

[0006] N(h)N: is the relative frequency of the history h in a training corpus;

[0007] P_(80 ortho) (w|h): the probability with which a given word w follows a word sequence h (history);

[0008] α: a predefined attribute in the speech model; f_(α)^(ortho)(h, w)

[0009]  an orthogonalized binary attribute function the attribute α; and m_(α)^(ortho):

[0010]  a desired boundary value in the system of boundary conditions.

[0011] The superscribed index “ortho” basically designates an orthoganlized value.

[0012] The attribute α can, by way of example, designate an individual word, a word sequence, a word class, such as color or verbs, a sequence of word classes or more complex structures.

[0013] The orthogonalized binary attribute function f_(α)^(ortho)(h, w)

[0014] makes, by way of example, a binary decision on whether given words are contained at certain positions in given word sequences h, w.

[0015] For word-based N-gram attributes α the orthogonalized attribute functions are specifically defined as follows: $\quad \begin{matrix} {{f_{\alpha}^{ortho}\left( {h,w} \right)} = \quad \left\{ {1\quad {if}\quad \alpha \quad {fits}\quad {the}\quad {word}\quad {sequence}\quad \left( {h,w} \right)\quad {and}} \right.} \\ {\quad \left\{ {{is}\quad {also}\quad {the}\quad {attribute}\quad {with}\quad {the}\quad {widest}\quad {range}} \right.} \\ {\quad \left\{ {{that}\quad {fits}} \right.} \\ {\quad \left\{ {0\quad {otherwise}} \right.} \end{matrix}$

[0016] If word and class-based attributes α (or discontinuous-N-grams of different discontinuous structures) are used, then these are accordingly subdivided into various attribute groups A_(i). In this case the orthogonalization of the attribute functions takes place in groups: $\begin{matrix} {{f_{\alpha}^{ortho}\left( {h,w} \right)} = \quad \left\{ {1\quad {if}\quad \alpha \quad {fits}\quad {the}\quad {word}\quad {sequence}\quad \left( {h,w} \right)\quad {and}} \right.} \\ {\quad \left\{ {{is}\quad {also}\quad {the}\quad {attribute}\quad {with}\quad {the}\quad {widest}\quad {range}} \right.} \\ {\quad \left\{ {{in}\quad {its}\quad {attribute}\quad {group}\quad A_{i}\quad {that}\quad {fits}} \right.} \\ {\quad {0\quad {otherwise}}} \end{matrix}$

[0017] The solution of the system of boundary conditions in accordance with formula (1), that is to say, the training object, is constituted by the so-termed maximum-entropy speech model MESM, which gives a suitable solution of the system of boundary conditions in the form of a suitable definition of the probability p(w|h), which reads as follows: $\begin{matrix} {{p_{\lambda^{ortho}}\left( w \middle| h \right)} = {\frac{1}{Z_{\lambda^{ortho}}(h)} \cdot {\exp \left( {\sum\limits_{\alpha}{\lambda_{\alpha}^{ortho} \cdot {f_{\alpha}^{ortho}\left( {h,w} \right)}}} \right)}}} & (2) \end{matrix}$

[0018] where the sum includes all the attributes α predetermined in the MESM; and where apart from the values listed above, the following magnitudes apply:

[0019] Z_(λortho) (h): a scaling factor;

[0020] λ^(ortho): a set of all orthogonalized free parameters.

[0021] The free parameters λ^(ortho) are adapted so that the formula (2) represents a solution for the system of boundary conditions in accordance with formula (1). This adaptation normally takes place with the help of so-termed training algorithms. An example of such a training algorithm is the so-termed Generalized Iterative Scaling Algorithm (GIS), which is described for orthogonalized attribute functions in: R. Rosenfeld “A maximum-entropy approach to adaptive statistical language modelling”; Computer Speech and Language, 10: 187-228, 1996.

[0022] Once an individual or various iterative training steps have been executed in the training algorithm, a control can be made in each case on how well the free parameter λ^(ortho) has now been set by the training algorithm. This normally takes place in that the λ^(ortho) value set by the training is used in accordance with the following formula (3) as a parameter for the calculation of an approximate boundary value M_(α)^(ortho)

[0023] for the desired boundary value M_(α)^(ortho):

$\begin{matrix} {M_{\alpha}^{ortho} = {\sum\limits_{({h,w})}{\frac{N(h)}{N} \cdot {p_{\quad_{\quad^{\lambda^{ortho}}}}\left( w \middle| h \right)} \cdot {f_{\alpha}^{ortho}\left( {h,w} \right)}}}} & (3) \end{matrix}$

[0024] with the magnitudes listed above.

[0025] A comparison of the calculated approximate boundary values M_(α)^(ortho)

[0026] with the desired boundary values m_(α)^(ortho)

[0027] allows a statement to be made about the quality of the setting found for the free parameters λ_(α)^(ortho).

[0028] In the calculation of the approximate boundary value M_(α)^(ortho)

[0029] in accordance with formula (3) for individual attributes α the case may arise that M_(α)^(ortho) = 0.

[0030] This case may arise if for the attribute α in the MESM attributes β with a wider range exist, which include the attribute α or in particular end with this. In that way the attribute α for certain word sequences (h, w) is blocked by the attribute β with the wider range in the sense that f_(α)^(ortho)(h, w) = 0.

[0031] If this is the case for all the unsolved (h, w) in formula (3), then in accordance with (1) the desired orthogonalized boundary value m_(α)^(ortho)

[0032] is also=0. This situation may be summarized by the formula $\begin{matrix} {{f_{\alpha}^{ortho}\left( {h,w} \right)} = {{0\quad {for}\quad {all}\quad \left( {h,w} \right)}\quad \in D_{c}}} & (4) \end{matrix}$

[0033] with

D _(c)={(h, w)|N(h)>0,wεV}

[0034] where

[0035] D_(c): represents a restricted definition range for the probability function p_(λ)(h|w), where all words w from a vocabulary V of the MESM are freely selectable and only so-termed seen histories h can arise, where the seen histories are those that occur at least once in the training corpus of the MESM, that is for which N(h)>0.

[0036] If it is found that for an attribute α whose orthogonalized approximate boundary value calculated in accordance with formula (3) M_(α)^(ortho) = 0,

[0037] then it can be concluded that the associated free parameter λ_(α)^(ortho)

[0038] is defined with a number of possible interpretations or unclearly; the execution of the training algorithm was then unsuccessful for this parameter λ_(α)^(ortho)

[0039] for the attribute α; the parameter λ_(α)^(ortho).

[0040] can then not be suitably set with the help of the normal training algorithm.

[0041] A free parameter λ_(α)^(ortho)

[0042] that has a number of possible interpretations has the disadvantage that the conditional probability p_(λ)(h|w) calculated on the basis of this in accordance with formula (2), with which a given word w follows an (unseen) history h, is defined with a number of possible interpretations or not at all. So the overall forecasting accuracy and efficiency of the corresponding speech model drops, and thus of a speech recognition system that works on the basis of the MESM.

[0043] Starting from this state of the art it is an object of the present invention to provide a speech recognition system, a training device and a method of setting a free parameter λ_(α)^(ortho)

[0044] of an attribute a in a maximum-entropy speech model MESM for the cases where a previous attempt at setting was unsuccessful with the help of a training algorithm.

[0045] This object is achieved as claimed in patent claim 1 by a method of setting a free orthogonalized parameter λ_(α)^(ortho)

[0046] of an attribute α in a maximum-entropy speech model MESM, if this free parameter could not be set with the help of a training algorithm that has been executed previously, where the attribute α belongs to an attribute group A_(i) from a total of i=1 . . . n attribute groups in the MESM, comprising the following steps:

[0047] a) Replacing a desired orthogonalized boundary value m_(α)^(ortho)

[0048]  for the attribute α with a modified desired orthogonalized boundary value m_(α)^(ortho, mod)

[0049]  with: $m_{\alpha}^{{ortho},{mod}} = {\sum\limits_{\beta \in A_{j}}m_{\beta}^{ortho}}$

[0050] where

[0051] βεA_(i): represents all the attributes β ε A_(i) that have a wider range than the attribute α, which end in the attribute α; and m_(β)^(ortho):

[0052]  represents the desired orthogonalized boundary values for the attributes β;

[0053] b) Calculating an expression ‘denominator_(α)’ according to: ${{denominator}_{\alpha} = {\sum\limits_{\beta \in {Ai}}{{\exp \left( {- \lambda_{\beta}^{ortho}} \right)} \cdot M_{\beta}^{ortho}}}},$

[0054] where

[0055] βεA_(i): represents all the attributes β ε A_(i) that have a wider range than the attribute α, which end in the attribute α; λ_(β)^(ortho):

[0056]  represents the free orthogonalized parameter of the MESM for attribute β; and M_(β)^(ortho):

[0057]  represents the approximate boundary value for the desired orthogonalized boundary value M_(β)^(ortho)

[0058]  for the attribute β;

[0059] and

[0060] c) Calculating the free orthogonalized parameter λ_(α)^(ortho)

[0061]  according to $\lambda_{\alpha}^{ortho} = {\log \left( \frac{m_{\alpha}^{{ortho},{mod}}}{{Nenner}_{\alpha}} \right)}$

[0062] The thus calculated value for the free parameter λ_(α)^(ortho)

[0063] for the attribute α has only one interpretation, i.e. it is no longer ambiguous. It is adapted such that it approximates well the associated boundary value m_(α)^(ortho,  mod)

[0064] for a restricted problem, i.e. for a reduced number of attributes within the MESM, which no longer have attributes β which have a wider range than the attribute α.

[0065] It is advantageous to use the orthogonalized free parameter λ_(α)^(ortho)

[0066] calculated with the help of the method in accordance with the invention for the calculation of a probability function p_(α)^(ortho)(w|h)

[0067] in accordance with formula (2), because this is better adapted to the text statistics on which the training object is based.

[0068] Further advantageous method steps are the subject of the dependent claims.

[0069] The object in accordance with the invention is further achieved by a training device for training a speech recognition system as well as by a speech recognition system that has such a training device. The advantages of these devices correspond to the advantages as they have been mentioned above for the method.

[0070] A comprehensive description follows of a preferred example of embodiment of the invention with reference to the attached Figure, with this showing a speech recognition system in accordance with the present invention.

[0071] The method in accordance with the invention comprises essentially two steps, that can be summarized as follows:

[0072] i) Selection of all those attributes a which are blocked in the training by attributes β which have a wider range for all (h, w)ε D_(c) within the meaning of the above definition.

[0073] ii) Simulation for all these attributes of an application in which the attribute α is used and execution then of an adaptation of λ_(α)^(ortho).

[0074] Use in these simulated applications not of the original, but of the modified, secondary conditions to fix the boundary conditions of the speech model.

[0075] The first step of the method is executed in that all those attributes are identified whose desired orthogonalized boundary values m_(α)^(ortho)

[0076] and whose approximate boundary values M_(α)^(ortho)

[0077] disappear or are equal to 0.

[0078] The second step of the method comprises a number of sub-steps, where generally a generalization is made of seen histories, that is, those histories that are contained in a training corpus MESM, and unseen histories which are not contained in the training corpus. The individual method steps are explained in the following with the example of a three-digit group attribute α=(y,z,w) in a word-based four-digit MESM.

[0079] 1. For each seen history h=(x,y,z) the trigram attribute α=(y,z,w) is blocked by a quadgram attribute β=(x,y,z,w); here “blocked” means that f_(α)^(ortho)(h,  w)_(α)^(rtho) = 0,

[0080] because the attributes α and β fit the word sequence (h, w) and because β has a greater range than α. The expression $\frac{N(h)}{N} \cdot {p\left( w \middle| h \right)}$

[0081] therefore makes a contribution to the approximate boundary value M_(β)^(ortho)

[0082] in accordance with formula (3) for the attribute β.

[0083] 2. For an unseen history h′=(x′,y,z) as a rule no quadgram attribute (x′,y,z,w) is defined and therefore α is in this case not blocked. If the training corpus were big enough to contain the history h′, then the term $\frac{N\left( h^{\prime} \right)}{N} \cdot {p\left( w \middle| h^{\prime} \right)}$

[0084] which is dependent on the free parameter λ_(α)^(ortho),

[0085] would be contained in the secondary conditions, that is to say, that it would be contained in the approximate boundary value M_(β)^(ortho)

[0086] This is not the case, however.

[0087] 3. In order to simulate a situation where the trigram α is not blocked and in which the parameter λ_(α)^(ortho)

[0088] actually makes a contribution towards calculating the conditional probability of a p(w|h), the following notional experiment is carried out, where “ortho, mod” designates modified orthogonalized magnitudes:

[0089] For each seen history h=(x,y,z) in the training corpus the blocking quadgram attribute β=(x,y,z,w) is removed. Each of these histories h then takes over the function of h′in sub-item 2.

[0090] As desired, the modified probability p^(mod)(w|h) then depends on the orthogonalized free parameter λ_(α)^(ortho),

[0091] but not on the free parameter λ_(α)^(ortho).

[0092] The attribute function associated with the attribute α then changes from f_(α)^(ortho) = 0

[0093] (for an unrestricted definition range) to $f_{\alpha}^{{ortho},{mod}} = {{\sum\limits_{\beta}f_{\beta}^{ortho}} \neq 0}$

[0094] because all blocking quadgram βs have been removed beforehand.

[0095] The expressions $\frac{N(h)}{N} \cdot {p^{mod}\left( w \middle| h \right)}$

[0096] then make a contribution to the modified orthogonalized approximate boundary value M_(α)^(ortho, mod)

[0097] instead of to the approximate boundary value M_(β)^(ortho).

[0098] The set of secondary conditions is modified:

[0099] a) All secondary conditions associated with the removed quadgram attributes are omitted.

[0100] b) The secondary condition associated with the trigram considered is based on the modified probability and the modified attribute functions.

[0101] As a consequence of this both sides of the formula (2) change:

[0102] The left side changes from M_(α)^(ortho) = 0  to  M_(α)^(ortho, mod).

[0103] The right side changes from $m_{\alpha}^{ortho} = {{0\quad {to}\quad m_{\alpha}^{{ortho},{mod}}} = {\sum\limits_{\beta}\quad m_{\beta}^{ortho}}}$

[0104] because all blocking quadgrams βs have been removed.

[0105] 4. It is now assumed that the set of all seen histories h=(x,y,z) together with the changes referred to corresponds to the set of unseen histories h′ and the applications of λ_(α)^(ortho).

[0106] The parameter λ_(α)^(ortho)

[0107] is now adapted or set such that the secondary condition assigned to it is approximately met.

[0108] 5. In order to actually perform the notional experiment, the dependency of the orthogonalized approximate boundary condition M_(α)^(ortho, mod)

[0109]  of the free parameter λ_(α)^(ortho)

[0110]  must be analyzed:

[0111] Initially the original probabilities are compared with the modified ones (as previously: h=(x,y,z), α=(y,z,w) and β=(x,y,z,w):

p(w|h)=Z ⁻¹(h)exp(λ_(β) ^(ortho))  (5)

p ^(mod)(w|h)=(Z ^(mod)(h))⁻¹exp(λ_(α) ^(ortho) )  (6)

[0112] with the following normalizations: $\begin{matrix} {{Z(h)} = {{\exp \left( \lambda_{\beta}^{ortho} \right)} + {\sum\limits_{v \neq w}\quad {\exp \left( \lambda_{({\ldots \quad,v})}^{ortho} \right)}}}} & (7) \end{matrix}$

$\begin{matrix} {{Z^{mod}(h)} = {{\exp \left( \lambda_{\alpha}^{ortho} \right)} + {\sum\limits_{v \neq w}\quad {\exp \left( \lambda_{({\ldots \quad,v})}^{ortho} \right)}}}} & (8) \end{matrix}$

[0113] where the designation ( . . . , v) designates the most extensive attributes that fit the word sequence (h, v).

[0114] Assuming that the free parameter λ_(α)^(ortho)

[0115] lies close to the free parameter λ_(α)^(ortho)

[0116] or that both exp (λ_(α)^(ortho))

[0117] and exp exp (λ_(α)^(ortho))

[0118] are significantly smaller than Σ_(v≠w)exp ( . . . ), the modified probability p^(mod) can be calculated as follows: $\begin{matrix} \begin{matrix} {{p^{mod}\left( w \middle| h \right)} = {\left( {Z^{mod}(h)} \right)^{- 1}{\exp \left( \lambda_{\alpha}^{ortho} \right)}}} \\ {\approx {{Z^{- 1}(h)}{\exp \left( \lambda_{\alpha}^{ortho} \right)}}} \\ {= {{\exp \left( {\lambda_{\alpha}^{ortho} - \lambda_{\beta}^{ortho}} \right)} \cdot {p\left( w \middle| h \right)}}} \end{matrix} & (9) \end{matrix}$

[0119] When the approximation is used in accordance with formula 9, the modified orthogonalized approximate boundary value M_(α)^(ortho, mod)

[0120] can easily be derived from the original boundary values M_(β)^(ortho).

[0121] More importantly, however, is that it is approximately proportional to the free parameter λ_(α)^(ortho),

[0122] as shown in the following: $\begin{matrix} \begin{matrix} {M_{({y,z,w})}^{{ortho},{mod}} = {\sum\limits_{({h,w})}\quad {\frac{N(h)}{N} \cdot {p^{mod}\left( w \middle| h \right)} \cdot {f_{({y,z,w})}^{{ortho},{mod}}\left( {h,w} \right)}}}} \\ {= {\sum\limits_{x}\quad {\frac{N\left( {x,y,z} \right)}{N} \cdot {p^{{ortho},{mod}}\left( {\left. w \middle| x \right.,y,z} \right)}}}} \\ {{\quad \quad}{\approx {\sum\limits_{x}\quad {\frac{N\left( {x,y,z} \right)}{N} \cdot \left\lbrack {{\exp \left( {\lambda_{({y,z,w})}^{ortho} - \lambda_{({x,y,z,w})}^{ortho}} \right)} \cdot {p\left( {\left. w \middle| x \right.,y,z} \right)}} \right\rbrack}}}} \\ {= {{\exp\left( \lambda_{({y,z,w})}^{ortho} \right)} \cdot {\sum\limits_{x}{{\exp\left( {- \lambda_{({x,y,z,w})}^{ortho}} \right)} \cdot \left\lbrack {\frac{N\left( {x,y,z} \right)}{N} \cdot {p\left( {\left. w \middle| x \right.,y,z} \right)}} \right\rbrack}}}} \\ {= {{\exp\left( \lambda_{({y,z,w})}^{ortho} \right)} \cdot {\sum\limits_{x}{{\exp\left( {- \lambda_{({x,y,z,w})}^{ortho}} \right)} \cdot M_{({x,y,z,w})}^{ortho}}}}} \end{matrix} & (10) \end{matrix}$

[0123] And, finally, equating the orthogonalized approximate boundary value M_(α)^(ortho, mod)

[0124] to the modified orthogonalized desired boundary value m_(α)^(ortho, mod)

[0125] leads to the desired and sought after adaptation of the orthogonalized parameter λ_(α)^(ortho),

[0126] which is then calculated as follows: $\begin{matrix} {{\exp\left( \lambda_{({y,z,w})}^{ortho} \right)} = \frac{m_{({y,z,w})}^{{ortho},{mod}}}{\sum\limits_{x}{{\exp\left( {- \lambda_{({x,y,z,w})}^{ortho}} \right)} \cdot M_{{x,y,z,w})}^{ortho}}}} & (11) \end{matrix}$

[0127] Such a setting of the free parameter λ_(α)^(ortho)

[0128] that used to have a number of possible interpretations allows a calculation of the probability p_(λ) in a training device or a speech recognition system, that better generalizes from the seen histories h to unseen histories h′.

[0129] The FIGURE accompanying the specification shows such a training device 10, which usually serves for training a speech recognition system that uses an MESM for the speech recognition. The training device 10 normally comprises a training unit 12 for training of free parameters λ_(α)^(ortho)

[0130] of the MESM with the help of a training algorithm, such as the GIS training algorithm. As shown in the introduction to the specification, the training of the free parameters λ_(α)^(ortho)

[0131] is not, however, always successful and it may thus happen that individual free parameters λ_(α)^(ortho)

[0132] of the MESM even after passing through the training algorithm still have not been adapted in the desired manner. They are particularly those attributes for which the orthogonalized approximate boundary values M_(α)^(ortho)

[0133] calculated in accordance with formula (3) give the value of 0.

[0134] In order to set also these non-adapted free parameters that have a number of possible interpretations to a suitable value, the training device 10 has an optimization unit 14, which receives the parameters that have a number of possible interpretations from the training unit 12 and optimizes them according to the method in accordance with the invention described previously.

[0135] Advantageously, but not necessarily, such a training device 10 forms part of a speech recognition system 100, that carries out speech recognition based on the MESM. 

1. A method of setting a free orthogonalized parameter λ_(α)^(ortho)

ps of an attribute α in a maximum-entropy speech model MESM, if this free parameter could not be set with the help of a training algorithm executed previously, where the attribute a belongs to an attribute group A_(i) from a total of i=1 . . . n attribute groups in the MESM, the method comprising the following steps: a) Replacing a desired orthogonalized boundary value m_(α)^(ortho)

 for the attribute a with a modified desired orthogonalized boundary value m_(α)^(ortho, mod)

 with: $m_{\alpha}^{{ortho},{mod}} = {\sum\limits_{\beta = A_{i}}m_{\beta}^{ortho}}$

where βεA_(i): represents all the attributes β ε A_(i) that have a wider range than the attribute α, which end in the attribute α; and m_(β)^(ortho):

 represents the desired orthogonalized boundary values for the attributes β; b) Calculating an expression ‘denominator_(α)’ according to: denominatorα $\sum\limits_{\beta \quad \in \quad A_{i}}{{\exp\left( {- \lambda_{\beta}^{ortho}} \right)} \cdot M_{\beta}^{ortho}}$

where βεA_(i): represents all the attributes β ε A_(i) that have a wider range than the attribute α, which end in the attribute α; λ_(β)^(ortho):

 represents the free orthogonalized parameter of the MESM for attribute β; and M_(β)^(ortho):

 represents the approximate boundary value for the desired orthogonalized boundary value for the attribute β; and c) Calculating the free orthogonalized parameter λ_(β)^(ortho)

 according to $\lambda_{\alpha}^{ortho} = {\log \left( \frac{m_{\alpha}^{{ortho},{mod}}}{{denominator}_{\alpha}} \right)}$


2. A method as claimed in claim 1, characterized in that the approximate boundary value M_(β)^(ortho)

in step 1b) is calculated according to: $M_{\beta}^{ortho} = {\sum\limits_{({h,w})}\quad {\frac{N(h)}{N} \cdot {p_{\lambda^{ortho}}\left( w \middle| h \right)} \cdot {f_{\beta}^{ortho}\left( {h,w} \right)}}}$

where: N: describes the number of words in a training corpus of the speech model; $\frac{N(h)}{N}\text{:}$

the relative frequency of the word sequence h (history) in the training corpus; and P_(λortho) (w|h): the probability with which a new given word w follows the previous history h; λ^(ortho): free orthogonalized parameters for all attributes α, β . . . ; f_(β)^(ortho):

the orthogonalized attribute function for the attribute β.
 3. The use of the orthogonalized free parameter λ_(α)^(ortho)

calculated as claimed in method claim 1 for the calculation of a probability function p_(λortho) (w|h) according to: ${p_{\lambda^{ortho}}\left( w \middle| h \right)} = {\frac{1}{Z_{\lambda^{ortho}}(h)}{{\exp \left( {\sum\limits_{\alpha}\quad {\lambda_{\alpha}^{ortho} \cdot {f_{\alpha}^{ortho}\left( {h,w} \right)}}} \right)}.}}$


4. A training device (10) for training a speech recognition system (100) which system uses a maximum-entropy speech model MESM for speech recognition, the training device comprising a training unit (12) for training free parameters λ_(α)^(ortho)

of the MESM with the help of a training algorithm; characterized by an optimization unit (14) for optimizing those free parameters λ_(α)^(ortho)

from the number of parameters λ_(α)^(ortho)

which could not be set by training in the training unit (12), in accordance with the method as claimed in claim
 1. 5. A speech recognition system (100) which carries out speech recognition on the basis of the MESM, comprising a training device (10) as claimed in claim
 5. 